Introduction

Escherichia coli is a gram negative bacterium and it is the most studied bacterial model organism for many reasons like its ability to grow fast on cheap media and the fact that it can be genetically manipulated and modified much easier with respect to eukaryotic organisms like yeast or animals or even other bacteria. When an organism is as studied as E. coli the amount of data available about it can be massive, for this reason we would like to perform a bioinformatic analysis to extract knowledge from on one of the many RNA-seq dataset available for E. coli.

This project aims to establish a proxy for determining the activity of transcription factors based on the expression levels of the genes regulated by those factors. In prokaryotes genes with related functions are encoded together within the same genome block, which is called operon, and they are usually transcribed together in a, so called, polycistronic transcript because they are under the control of the same promoter.

The null hypothesis posits that the abundance of transcripts serves as a reliable proxy for activity of the regulator. However, this assumption may not hold for all transcription factors: some factors are active only when phosphorylated, meaning that their transcripts may be constitutively expressed while their activity depends on phosphorylation, correlating instead with the activity of the corresponding kinase.

We will also focus on developing a general model to relate transcription factor activity to transcript abundance. Significant deviations from this model could indicate transcription factors whose activity is not accurately reflected by transcript levels. One challenge of this approach is the fact that genes of the same operon may be highly co-expressed together which is a problem for many models like linear regression, so we will try to address it during our analysis.

Table of Contents

Here are the steps that we performed in our analysis

  1. Introduction

  2. Library loading

  3. Pre-processing

  4. Exploratory Analysis

  5. Modelling -cAMP Receptor Protein (crp)

Library loading

library(readr)
library(matrixStats)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(caret)
library(glmnet)
library(igraph)
library(vip)
library(ggrepel)
library(rattle)
library(randomForestExplainer)

Pre-processing

Let’s start importing the dataset from PreciseDB which is an RNA-seq compendium for Escherichia coli from the Systems Biology Research group at UC San Diego which contains the expression levels for 3,923 genes in E. coli for 278 different condition. The approach they used was culturing bacteria on many different media and they measured gene expression values then they proceeded to knock-out many different genes and cultured again on the same media to measure expression again because they were expecting variations of expression; for example, they knocked out a gene for iron metabolism on a medium containing iron and this for sure affects the expression levels of the E.coli culture because the organism is not able to manage iron correctly anymore.

# Import file 
log_tpm <- read.csv("log_tpm_full.csv", row.names = 1)
log_tpm

Next we want to perform some pre-processing. First we want to exclude:

For the nature of our analysis we are not interested in the conditions in which a gene is knocked-out because the influence of the knock-out gene may be huge on its related pathways, for this reason we excluded them because we want unbiased results. Researchers of PreciseDB also collected expression data about many gene isoforms, the problem is that the collection of isoforms was not complete and there was not a reliable way to decide which isoform to consider: considering all of them during the preprocessing would have resulted in having unmappable bnumbers and duplicate genes so we removed them from the dataset before proceeding.

# Exclude condition with knockout genes
log_tpm <- log_tpm[, -grep("_del", colnames(log_tpm))]

# Drop genes with isoform
genes_with_isoforms <- grep("_\\d+$", rownames(log_tpm), value = TRUE)

log_tpm <- subset(log_tpm, !(rownames(log_tpm) %in% genes_with_isoforms))
dim(log_tpm)
## [1] 4257  188

Now we can proceed to import the regulatory network from RegulonDB that reports the target genes for each regulator. Each row of this dataset represents an edge in a regulatory network graph and contains informations about the interaction of which one of the most important one is probably if the interaction is positive or negative.

regulator <- read.table(file="tableDataReg.csv",
                        header=TRUE, 
                        sep=",")
regulator

We decided to eliminate:

regulator <- regulator[which(regulator[, 2] != "ppGpp"), ]

w <- which(trimws(regulator[,7])=="W")
if(length(w)>0){
  regulator <- regulator[-w,]
}

w <- which(trimws(regulator[,7])=="?")
if(length(w)>0){
  regulator <- regulator[-w,]
}

nrow(regulator)
## [1] 4229

There is a discrepancy between the gene identifiers present in the regulatory network obtained from RegulonDB and those in the PreciseDB dataset. While the regulatory network uses gene names, our dataset contained identifiers in the form of bnumbers. So it’s better to convert bnumbers to gene names to facilitate comparison and downstream analysis. In addition we decided to:

# Loading the file from ECOcyc which contain the mapping bnum-genename
map_bnum <- read.delim("mapbnum.txt", header = TRUE)

# Keep only the useful columns
map_bnum <- map_bnum[c("Gene.Name", "Accession.1")]

# Map between our dataset and the file of ecocyc
log_tpm$gene_number <- rownames(log_tpm)
log_tpm <- merge(log_tpm, map_bnum, by.x = "gene_number", by.y = "Accession.1", all.x = TRUE)

# Rearrange the dataset
log_tpm <- log_tpm[, c("Gene.Name", setdiff(names(log_tpm), "Gene.Name"))]

# Removing unmapped genes bnumber
log_tpm <- subset(log_tpm, !is.na(Gene.Name))

# Removing duplicate genes (it also has all expression values equal 0 so very bad)
log_tpm <- subset(log_tpm, !(log_tpm$Gene.Name == "insI2"))

# Setting rownames and dropping the first 2 columns
rownames(log_tpm) <- log_tpm$Gene.Name
log_tpm <- log_tpm[,3:ncol(log_tpm)]

# Transpose log_tpm in order to have genes as columns and conditions as rows
log_tpm <- t(log_tpm)

Exploratory Analysis

We would like to verify if we can use one the four major summary statistics which are mean, median, maximum and minimum of expression of each gene to model the relationship between the targets and their regulator. Now let’s analyze the distribution of these statistics to understand which is the best value to choose, if any, then we also want check if they follow a Gaussian distribution using the Shapiro–Wilk test which computes the W statistics as follows:

\[ W = \dfrac{\big(\sum^n _ {i=1} a_i x_{(i)} \big ) ^2}{\sum^n _ {i=1} (x_i - \bar{x})^2} \]

Then this W value is used for the hypothesis testing that follows:

\[ H_o: \text{a sample } x_1, \cdots, x_n \text{ is drawn from a normally distributed population.} \\ H_a: \text{a sample } x_1, \cdots, x_n \text{ is not drawn from a normally distributed population.} \]

If the test returns a p-value lower then the threshold 0.05 it means that the data do not follow a normal distribution.

# Histograms of Summary Statistics

# MEAN
log_tpm_mean <- data.frame(value = apply(log_tpm, 2, mean))
mean_hist <- ggplot(log_tpm_mean, aes(x = value)) +
                    geom_histogram(binwidth = 0.5, fill = "skyblue", 
                                   color = "black", bins = 100) +
                    labs(x = "Mean log-TPM", y = "Frequency") +
                    theme_minimal() +
                    theme(panel.grid.major = element_blank(), 
                          panel.grid.minor = element_blank())

# MEDIAN
log_tpm_median <- data.frame(value = apply(log_tpm, 2, median))
median_hist <- ggplot(log_tpm_median, aes(x = value)) +
                      geom_histogram(binwidth = 0.5, fill = "lightgreen", 
                                     color = "black", bins = 100) +
                      labs(x = "Median log-TPM", y = "Frequency") +
                      theme_minimal() +
                      theme(panel.grid.major = element_blank(), 
                            panel.grid.minor = element_blank())

# MAX
log_tpm_max <- data.frame(value = apply(log_tpm, 2, max))
max_hist <- ggplot(log_tpm_max, aes(x = value)) +
                   geom_histogram(binwidth = 0.5, fill = "lavender", 
                                  color = "black", bins = 100) +
                   labs(x = "Max log-TPM", y = "Frequency") +
                   theme_minimal() +
                   theme(panel.grid.major = element_blank(), 
                         panel.grid.minor = element_blank())

# MIN
log_tpm_min <- data.frame(value = apply(log_tpm, 2, min))
min_hist <- ggplot(log_tpm_min, aes(x = value)) +
                   geom_histogram(binwidth = 0.5, fill = "lightpink", 
                                  color = "black", bins = 100) +
                   labs(x = "Min log-TPM", y = "Frequency") +
                   theme_minimal() +
                   theme(panel.grid.major = element_blank(), 
                         panel.grid.minor = element_blank())

# Final Plot
grid.arrange(mean_hist, median_hist, max_hist, min_hist, nrow = 2, ncol = 2,
             top = textGrob("Histograms of Summary Statistics", 
                            gp=gpar(fontsize=16)))

#MEAN
shapiro.test(log_tpm_mean$value) #not gaussian
## 
##  Shapiro-Wilk normality test
## 
## data:  log_tpm_mean$value
## W = 0.98717, p-value < 2.2e-16
#MEDIAN
shapiro.test(log_tpm_median$value) #not gaussian
## 
##  Shapiro-Wilk normality test
## 
## data:  log_tpm_median$value
## W = 0.98578, p-value < 2.2e-16
#MAXIMUM
shapiro.test(log_tpm_max$value) #not gaussian
## 
##  Shapiro-Wilk normality test
## 
## data:  log_tpm_max$value
## W = 0.99611, p-value = 3.889e-09
#MINIMUM
shapiro.test(log_tpm_min$value) #not gaussian
## 
##  Shapiro-Wilk normality test
## 
## data:  log_tpm_min$value
## W = 0.91615, p-value < 2.2e-16

Based on the results of the Shapiro-Wilk tests conducted on the four distributions and their respective histograms, we can conclude that the distributions do not follow to a Gaussian (normal) distribution.
Consequently, this implies that choosing measures such as the mean, median, maximum, or minimum may not match the linear regression assumption that the data follows a Gaussian (normal) distribution, for this reason it is not possible to use one of these statistics as a predictor in our model because none of them is normally distributed but the biggest problem is that we would train a model on a single sample and it does not make any sense. In the next section about modelling we will try instead to take the mean of all the targets in every condition and fit a model with 1 feature (mean of the targets) and as many samples as the conditions that we have.

Now we plot some histograms to have a better idea on the number of positive, negative and total interactions of each regulator and again we perform a Shapiro–Wilk test for each one of them to check for normality.

# Divide positive and negative regulation
positive_reg <- regulator[regulator$X6.function == "+",]
negative_reg <- regulator[regulator$X6.function == "-",]

# Identify the different regulators in the dataset
unique_regulators <- unique(regulator$X3.RegulatorGeneName)

pos_counts <- c()
neg_counts <- c()

# Count how many genes are regulated by each regulator
for(reg in unique_regulators){
  pos_counts <- c(pos_counts, 
                  count(positive_reg[positive_reg$X3.RegulatorGeneName == reg,]))
  neg_counts <- c(neg_counts, 
                  count(negative_reg[negative_reg$X3.RegulatorGeneName == reg,]))
}

pos_counts <- unlist(pos_counts)
names(pos_counts) <- unique_regulators

neg_counts <- unlist(neg_counts)
names(neg_counts) <- unique_regulators

pos_counts <- data.frame(value = pos_counts)
neg_counts <- data.frame(value = neg_counts)
total_counts <- data.frame(value = pos_counts$value + neg_counts$value)

# Histograms of how many genes are regulated by each gene

# Positive Counts Histogram
pos_counts_hist <- ggplot(pos_counts, aes(x = value)) +
                          geom_histogram(binwidth = 10, fill = "skyblue", 
                                         color = "black", bins = 20) +
                          labs(x = "Positive Regulations Count", y = "Frequency") +
                          theme_minimal() +
                          theme(panel.grid.major = element_blank(), 
                                panel.grid.minor = element_blank())

# Negative Counts Histogram
neg_counts_hist <- ggplot(neg_counts, aes(x = value)) +
                          geom_histogram(binwidth = 10, fill = "lightgreen", 
                                         color = "black", bins = 20) +
                          labs(x = "Negative Regulations Count", y = "Frequency") +
                          theme_minimal() +
                          theme(panel.grid.major = element_blank(), 
                                panel.grid.minor = element_blank())

# Total Counts Histogram
total_counts_hist <- ggplot(total_counts, aes(x = value)) +
                            geom_histogram(binwidth = 10, fill = "lightpink", 
                                           color = "black", bins = 20) +
                            labs(x = "Total Regulations Count", y = "Frequency") +
                            theme_minimal() +
                            theme(panel.grid.major = element_blank(), 
                                  panel.grid.minor = element_blank())

# Final Plot
grid.arrange(pos_counts_hist, neg_counts_hist, total_counts_hist, ncol = 1)

# Positive Interactions Counts normality test
shapiro.test(pos_counts$value) #not gaussian
## 
##  Shapiro-Wilk normality test
## 
## data:  pos_counts$value
## W = 0.29429, p-value < 2.2e-16
# Negative Interactions Counts normality test
shapiro.test(neg_counts$value) #not gaussian
## 
##  Shapiro-Wilk normality test
## 
## data:  neg_counts$value
## W = 0.25872, p-value < 2.2e-16
# Total Interactions Counts normality test
shapiro.test(total_counts$value) #not gaussian
## 
##  Shapiro-Wilk normality test
## 
## data:  total_counts$value
## W = 0.29589, p-value < 2.2e-16

These histograms show how many genes are regulated by how many regulators, the horizontal axis indicates how many genes are controlled by the regulator and the vertical axis indicates how many regulators have that number of interactions; for example there are more than 150 regulators that positively interact with 0 genes, this is not surprising because we removed weak and unknown interactions and we can also see that there is a gene that negatively controls a number of genes close to 400. There are slightly more negative regulators that interact with at least one gene then positive. In addition we observe that majority of genes does not interact with other genes but there are some exceptions like the gene crp which positively interacts with 310 other genes. We can also note that none of these is normally distributed.

Network analysis

# Creating adj matrix to plot the network
edge_list <- cbind(RegulatorName = regulator$X3.RegulatorGeneName, 
                   Target = regulator$X5.regulatedName)
graph <- graph_from_edgelist(edge_list)
layout <- layout_with_fr(graph, niter=100)


# Visualizzazione 3D della network
plot.igraph(graph, layout=layout, vertex.label=NA, vertex.size=3, edge.arrow.size=0.2, edge.curved=TRUE, main="E.coli Network", xlim=c(-0.5, 0.5), ylim=c(-1, 1))

Modelling

In order to find a mathematical model able to describe the complexity of this biological problem we would like to start with simpler cases and look for what might be the best approach for them then we will apply this approach to each regulator. For this reason we will individually perform the modelling on the positive interactions of 2 genes to see which model can be suitable to be applied to all the genes; We chose the genes crp and gene da scegliere

cAMP Receptor Protein (crp)

The protein CRP is a global transcription regulator, which plays a major role in carbon catabolite repression (CCR) as well as other processes because it can act as an activator, repressor, coactivator or corepressor. CRP binds cyclic AMP (cAMP) which allosterically activates it, this means that its activity should not be influenced by the activity of any kinase, then it binds to its DNA binding site to directly regulate the transcription of about 300 genes in about 200 operons and indirectly regulate the expression of about half the genome.

Simple Linear Regression

The first thing we want to try, as anticipated in the previous section, is taking the mean of all the targets in every condition and fit a linear regression model with only this one feature. Let’s start to take the data about crp expression and the expression of its targets

# Expression of crp
crp_exp <- log_tpm[,colnames(log_tpm) == "crp"]

# List of positively regulated targets by crp
crp_target <- positive_reg[positive_reg$X3.RegulatorGeneName == "crp",]

# Expression of the targets
crp_target_exp <- log_tpm[,colnames(log_tpm) %in% crp_target$X5.regulatedName]

Now we take the simple mean of all the genes of each condition, in this way we will obtain as many means as the number of samples (188) and we fit a linear regression model using this mean as the sole feature. Then we put together a dataframe with all the data needed.

# Apply the mean to each row of the dataset (mean of each condition)
crp_target_mean_conditions <- apply(crp_target_exp, 1, mean)

# Creating the training dataframe
crp_mean_train <- data.frame(crp = unlist(crp_exp), 
                             targets = unlist(crp_target_mean_conditions))

All the models that we will fit in this project will be validated with 10-fold cross-validation, cross-validation is a simple and widely used sampling technique for estimating prediction error to assess how the model would act against unseen data. It is often done when there is not enough data to set aside a validation set and it has many advantages like preventing overfitting by providing a more robust estimate of the model’s performances and allowing model selection and hyperparameter tuning while being at the same time data efficient because all the data is used for training and testing the model. In the case of linear regression the purpose of cross-validation is only to get a robust estimate of the performances but in the next sections we will try to use it for model selection and hyperparameter tuning. The rationale of 10-fold cross-validation is very simple because it divides the dataset in 10 folds then it uses 9 for training and the last one for testing, this procedure is repeated other 9 times by changing the testing fold and by using the remaining ones for training.

As you can see from the picture above at every iteration the procedure computes an error metric and at the end it will take the average of all the metrics to gave us a more robust estimate of real testing error. The caret package is the R equivalent of Python scikit-learn and it allows to very easily set up the training of Machine Learning models and it also provide a very nice way to set-up a cross-validation in order to have more reliable models!

## 10-fold CV
fit_control <- trainControl(
  method = "repeatedcv",
  number = 10,
  ## repeated ten times
  repeats = 10)

Let’s get back to linear regression, using one feature we are in a setting for the use of simple linear regression which is a model that assumes that there is a linear relationship between \(X\) and \(Y\), in this case \(Y\) is the expression of our regulator crp and \(X\) is the mean of all its targets. The relationship between \(X\) and \(Y\) is modelled as follows:

\[Y \approx \beta_0 + \beta_1 X\]

To find a solution for this problem the Least Squares Method is employed: we know that \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\) is a prediction from our model and \(e_i = y_i - \hat{y_i}\) is the residual error, that is the difference between the actual value \(y_i\) and the predicted one \(\hat{y_i}\).
Least squares finds a solution that minimizes the residual sum of squares of all training data, RSS.

\[ RSS = e_1^2 + e_2^2 + \dots + e_N^2 = \sum_i (y_i - \hat{y_i})^2 = \sum_{i=1}^N (y_i - \hat{\beta_0} - \hat{\beta_1}x_i)^2 \]

Now we compute the partial derivatives of RSS with respect to the model parameters \(\hat{\beta_0}\) and \(\hat{\beta_1}\) and equate them to 0. The RSS function is globally convex, meaning that there exist a single stationary point which is a minimum for this reason we don’t need to compute the second derivative to find the global minumun and we can directly solve the equations obtained before with respect to the parameters in this way:

\[\hat{\beta}_{1}=\frac{\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}} \quad\quad\] \[\hat{\beta}_{0}=\overline{y}-\hat{\beta}_{1} \overline{x}\]

where \(\bar{x}\) is the mean of the means of expression of crp targets in all the conditions and \(\bar{y}\) is the mean expression of crp.

As anticipated before the fitting of linear regression with 10-fold cross-validation with caret can be achieved in a very simple way and with few lines of code

#cv is a random procedure so we set a random seed for reproducibility
set.seed(123)

crp_lm_mean <- train(crp ~ targets, 
                     data = crp_mean_train,
                     method = "lm",
                     trControl = fit_control)
crp_lm_mean
## Linear Regression 
## 
## 188 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 170, 168, 169, 171, 169, 169, ... 
## Resampling results:
## 
##   RMSE       Rsquared    MAE      
##   0.3720215  0.03536782  0.2895687
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The coefficient of determination, or \(R^2\), is a measure that provides information about the goodness of fit of a model. In other word it is a value that tells us about the variance of the data explained by the model, it is computed using the following formula.

\[ R^2 = \dfrac{\sum^{n} _ {i = 1} \big(\hat{y_i} -\bar{y} \big)^2}{\sum^n _ {i = 1} (y_i -\bar{y})^2} \]

\(R^2\) values close to 1 tell us that the model is able to explain almost 100% of the variance of the data indicating that the model fitted very well the data and values close to 0 tell the opposite, in this case we got an \(R^2\) = 0.03536782 which is a very low value and this means that the regression was not able to model the expression of the gene crp using the mean of its target. This result can be further explained by looking if some assumption of linear regression was not matched, more in detail we will now look for homoschedasticity and normality of the residuals.

# scatterplot of the data with the regression line
crp_lm_mean_scatterplot <- ggplot(crp_mean_train, aes(x = targets, y = crp)) +
                           ggtitle("Scatterplot of targets mean and crp") +
                           theme(plot.title = element_text(hjust = 0.5)) + 
                           xlab("Targets Mean in each Condition") + 
                           ylab("crp expression in log(TPM)") + 
                           geom_point(size = 2) + 
                           geom_abline(intercept = crp_lm_mean$finalModel$coefficients[[1]],
                                       slope = crp_lm_mean$finalModel$coefficients[[2]],
                                       color = "red",
                                       linewidth = 1)  

# Dataframe containing fitted values and the residuals
crp_res <- data.frame(fitted.values = crp_lm_mean$finalModel$fitted.values,
                      residuals = crp_lm_mean$finalModel$residuals)

# Residuals scatterplot
crp_lm_mean_residuals <-  ggplot(crp_res, aes(x = fitted.values, y = residuals)) +
                          ggtitle("Residuals plot") +
                          theme(plot.title = element_text(hjust = 0.5)) + 
                          xlab("Fitted Values") +
                          ylab("Residuals") + 
                          geom_point() +
                          geom_hline(yintercept = 0, color = "red")

# Final plot
grid.arrange(crp_lm_mean_scatterplot, crp_lm_mean_residuals, nrow = 1)

On the left you can see the scatterplot of the targets mean in each condition as the independent variable and the value of expression of crp in \(log(TPM)\) as the dependent variable with the regression line predicted by the model above drawn; on the right you can see the scatterplots of the residuals and we can notice just by looking that one assumption of linear regression is not matched: the variance of the residuals is not constant which means that the homeschedasticity of the residuals assumption is not met. Now let’s look for the normality of the residuals.

crp_lm_mean_qq <- ggplot(crp_res, aes(sample=residuals)) +
                  ggtitle("Residuals QQ-plot") +
                  theme(plot.title = element_text(hjust = 0.5)) +
                  xlab("Theoretical Quantile") +
                  ylab("Residuals Quantiles") + 
                  stat_qq() + 
                  stat_qq_line(color = "red")
crp_lm_mean_qq

shapiro.test(crp_res$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  crp_res$residuals
## W = 0.98452, p-value = 0.03616

By looking at the QQ-plot we can see that the points are on the line for the most part but they are skewed at the tails and considering a p-value threshold of 0.05 for the normality test we can confidently say that the residuals are not normally distributed: another important assumption of linear regression is not matched. Considering all the elements discussed before we conclude that this simplicistic model is not able to describe the complex relationship between a regulator and its targets.

Multivariate Linear Regression

Since only one variable was not enough to explain the relationship between the regulator and its targets we will now try to extend linear regression in order to include more variables in the model. Similarly to simple linear regression, we want to find the values for the model’s parameters that minimize the residual sum of squares. The mathematical formulation of a multivariate linear regression is the following:

\[\hat{y}=\hat{\beta}_{0}+\hat{\beta}_{1} x_{1}+\hat{\beta}_{2} x_{2}+\cdots+\hat{\beta}_{p} x_{p}\]

The least squares solution for this problem involves matrix multiplication and the use of the, so called, pseudo-inverse to compute the \({\hat\beta}\) coefficients then it uses them to compute the model’s predictions \(\hat{y}\) by multiplying the vector of coefficients \(\hat{\beta}\) with the matrix of the predictors \(X\).

\(\hat{\beta}=\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \mathbf{y} \quad\quad\) with \(X\) a \(N \times (p+1)\) matrix, \(y\) a \(N \times 1\) vector

\(\hat{\mathbf{y}}=\mathbf{X} \hat{\beta}=\mathbf{X}\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \mathbf{y} \quad\quad\) with \(\hat{y}\) the predicted value for each sample

The first step is creating a dataframe containing the expression of crp and of all its targets.

crp_exp <- data.frame(crp = crp_exp)
crp_full <- data.frame(cbind(crp_exp, crp_target_exp))

Then we train a model using all the target genes as predictors for crp expression with no further pre-processing and we cross-validate it.

set.seed(123)
crp_lm <- train(crp ~., 
                data = crp_full,
                method = "lm",
                trControl = fit_control)
crp_lm
## Linear Regression 
## 
## 188 samples
## 300 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 170, 168, 169, 171, 169, 169, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   9.858278  0.09793242  7.807553
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

As you can see the \(R^2\) of this model is still not satisfying at all and the RMSE is also very high: the Root Mean Square Error (RMSE) is a metric that describes the average distance between the predicted values and the real values

\[ RMSE = \sqrt{\dfrac{\sum^{n} _ {i = 1} \big(\hat{y_i} -\bar{y} \big)^2}{n - p}} \]

which means that a value of RMSE = 9.858278 is very high in the context of gene expression because we are working in logarithmic scale. The reason why this model is so bad is directly told us by caret with a warning message (here suppressed) which says that the model was not able to obtain a lot of \(\beta\) coefficients because our predictor variables are correlated and an important assumption of linear regression is the independence of the predictor variables. This strong correlation between our variables is expected because they are all genes controlled by our regulator crp so it makes very much sense that they are correlated and it is also one of the reasons of this analysis. In order to remove this multi-collinearity we will try 2 main approaches: feature selection and dimensionality reduction.

Lasso Regression

The first approach we want to try is the Lasso Regression that is essentially multivariate linear regression but with the introduction of a penalization term which regularizes the estimates of the coefficients. This method is part of the family of the shrinkage methods and its main advantage is that it can force some of the coefficients of the model to be equal to 0 which means that it is also an effective way to perform feature selection. For this purpose Lasso minimizes the following function:

\[ \sum^n _{i=1} (y_i - \beta_0 - \sum^p _{j = 1}\beta_jx_{ij})^2 + \lambda\sum^p_{j = 1}|\beta_j| \]

Here p is the number of predictors (number of targets of crp) and \(\lambda\) is a tunable hyperparameter which tells us how much we can “spend” in terms of coefficients, the right value of \(\lambda\) forces some coefficients to 0 and perform feature selection. The best value for \(\lambda\) has to be found with cross-validation for hyperparameter selection, it is interesting to note that if \(\lambda\) = 0 this minimization is simply the least squares method. Another important thing to do is to scale our data because Lasso is a method sensible to the scale, luckily caret can do this very easily using the preProcess argument in the train function.

set.seed(123)
# Tuning grid for Lasso 
tune_grid_lasso <- expand.grid(alpha = 1, #tell the function to perform lasso
                               lambda = 10^(-5:5)) #values of lambda to try

# Lasso training, cv and hyperparameter tuning
crp_lasso <- train(crp ~., 
                   data = crp_full,
                   method = "glmnet",
                   preProcess = c("center", "scale"), #this basically transform in Z scores
                   trControl = fit_control,
                   tuneGrid = tune_grid_lasso)
crp_lasso
## glmnet 
## 
## 188 samples
## 300 predictors
## 
## Pre-processing: centered (300), scaled (300) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 170, 168, 169, 171, 169, 169, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   1e-05   0.1752410  0.7797536  0.1369286
##   1e-04   0.1752410  0.7797536  0.1369286
##   1e-03   0.1752410  0.7797536  0.1369286
##   1e-02   0.1952233  0.7359266  0.1538322
##   1e-01   0.3180020  0.4220927  0.2470862
##   1e+00   0.3705970        NaN  0.2885583
##   1e+01   0.3705970        NaN  0.2885583
##   1e+02   0.3705970        NaN  0.2885583
##   1e+03   0.3705970        NaN  0.2885583
##   1e+04   0.3705970        NaN  0.2885583
##   1e+05   0.3705970        NaN  0.2885583
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.001.

For the tuning of \(\lambda\) we tried powers of 10 from \(10^{-5}\) to \(10^5\) and as you can see for values greater than \(10^{-1}\) the model was not able to correctly converge (caret gave a warning we suppressed) for the same reasons of linear regression discussed before but for smaller values of \(\lambda\) they were able to converge and obtain good \(R^2\).

cv_lasso_plot <-  ggplot(data = crp_lasso$results, aes(x = log10(lambda), y = RMSE, group=1)) +
                  ggtitle("Lasso Cross-Validation Results") +
                  theme(plot.title = element_text(hjust = 0.5)) +
                  xlab("log(λ)") + 
                  ylab("RMSE (Repeated Cross-Validation)") + 
                  geom_line(color="blue")+
                  geom_point()

cv_lasso_plot

Finally with 10-fold cross-validation, as you can see from the plot above, we chose \(\lambda = 10^{-3}\) as the final hyperparameter and the RMSE of this model is also very acceptable if we think in logarithmic scale in the context of gene expression. Let’s now take a look to the coefficients of the model

plot(crp_lasso$finalModel, xvar = "lambda", xlab= "log(λ)")

In this messy plot each line represents how the estimated coefficient \(\hat\beta\) of a gene changes with respect to the value of \(\lambda\), as you can see the majority of these lines go to 0 before \(log(\lambda) = -3\) so these genes have their coefficient set to 0 while the remaining genes have a coefficient different from 0.
Another interesting way of visualizing the coefficients and their importance is through the use of the vip package. This package gives nice plots by ranking the variables with a model specific importance metric, in the case of Lasso it returns the absolute values of the coefficients and ranks them in descending order.

vip(crp_lasso$finalModel, method = "model", num_features = 125)

It is very interesting to note that the first 121 gene coefficients are different from 0 and the remaining 189 gene coefficients are set to 0 by the Lasso procedure (here only 4 shown for simplicity). Let’s now look at some of the most important genes and their biological function to see if their importance makes sense:

  1. sucD: succinyl-CoA synthetase is a very important enzime in the citric acid cycle, one the most fundamental processes of every living organism thus it enables the survival of the cell. sucD specifically is part of the \(\alpha\) subunit which binds CoA and phosphate while the \(\beta\) subunit binds the succinate in order to catalize the synthesis of succinyl-CoA. From a biological point of view it makes sense that this gene is very important because it is part of a basic biochemical process and in the image above you can see that crp, activated by cAMP, is involved in the transcription of the operon that contains sucD.
  2. rpsF: 30S ribosomal subunit protein S6, as the name says, is a subunit of the ribosome which is the macromolecular complex, composed by rRNA and proteins, involved in the translation of mRNA to protein. Again this is a very basic biological process and also involve protein synthesis and we should remember that CRP itself is a protein so it makes much sense that the expression of this gene is able to explain the expression of crp.
  3. gatY: tagatose-1,6-bisphosphate aldolase 2 is important in the process of galactitol degradation, a toxic alcohol which is a subproduct of galactose metabolism. Since galactose is a sugar used in many of the media on which the cultures of Escherichia coli are grown it makes sense that a gene involved in the degradation of a toxic compound related to galactose is important for crp, a global transcription factor, because it ensures the global survival of the cell in its environment.

Now we know that Lasso may be a reliable method for the modelling of our targets-regulator relationship but it is heavily dependent on the value of \(\lambda\) chosen which might be different for other regulators, when we will analize all the genes we will try to verify this possibility.

Principal Component Regression

set.seed(123)

# Build tuning grid for Principal Component Regression
tune_grid_pcr <- expand.grid(ncomp = 1:((ncol(crp_full) - 1) %/% 2))

crp_pca <- train(crp ~., 
                 data = crp_full, 
                 method = "pcr",
                 trControl = fit_control,
                 preProcess = c("center", "scale", "zv"),
                 tuneGrid = tune_grid_pcr)

crp_pca
## Principal Component Analysis 
## 
## 188 samples
## 300 predictors
## 
## Pre-processing: centered (300), scaled (300) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 170, 168, 169, 171, 169, 169, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##     1    0.3718474  0.03440425  0.2897315
##     2    0.3747414  0.04815675  0.2915025
##     3    0.3409510  0.20124208  0.2643586
##     4    0.3372675  0.21590502  0.2593136
##     5    0.3353930  0.22660318  0.2571788
##     6    0.3330983  0.23736598  0.2548380
##     7    0.3304441  0.24752351  0.2525811
##     8    0.3254471  0.26984751  0.2474775
##     9    0.3276745  0.26271768  0.2494099
##    10    0.3100244  0.34842809  0.2376802
##    11    0.2843674  0.45367641  0.2195197
##    12    0.2788693  0.47051379  0.2152310
##    13    0.2760754  0.48094358  0.2139215
##    14    0.2741463  0.48724113  0.2139713
##    15    0.2740862  0.48720982  0.2148147
##    16    0.2699393  0.50301699  0.2094606
##    17    0.2710242  0.50021365  0.2106055
##    18    0.2634599  0.52282856  0.2055584
##    19    0.2605027  0.53367113  0.2038357
##    20    0.2522956  0.56158106  0.1982114
##    21    0.2421559  0.59181582  0.1900297
##    22    0.2304458  0.63203258  0.1811293
##    23    0.2269192  0.64356770  0.1783152
##    24    0.2257298  0.64671996  0.1769401
##    25    0.2248009  0.64905947  0.1764125
##    26    0.2252216  0.64797482  0.1771727
##    27    0.2222997  0.65648323  0.1749798
##    28    0.2172178  0.66969581  0.1712180
##    29    0.2168804  0.67132908  0.1714901
##    30    0.2177709  0.66838278  0.1718629
##    31    0.2179676  0.66739297  0.1726191
##    32    0.2173162  0.66945045  0.1724441
##    33    0.2155513  0.67578785  0.1708326
##    34    0.2147002  0.67910341  0.1704050
##    35    0.2143595  0.68005267  0.1703621
##    36    0.2146375  0.67957994  0.1712152
##    37    0.2156180  0.67657983  0.1719414
##    38    0.2153570  0.67653085  0.1716978
##    39    0.2125905  0.68577231  0.1691313
##    40    0.2109955  0.69034142  0.1675837
##    41    0.2105272  0.69176533  0.1674892
##    42    0.2101604  0.69366549  0.1671096
##    43    0.2093439  0.69548975  0.1660358
##    44    0.2098461  0.69544172  0.1665025
##    45    0.2097311  0.69518241  0.1667362
##    46    0.2092721  0.69679943  0.1667639
##    47    0.2098076  0.69495504  0.1671907
##    48    0.2088363  0.69643040  0.1668270
##    49    0.2080494  0.69821392  0.1663328
##    50    0.2065316  0.70228228  0.1655262
##    51    0.2052720  0.70594587  0.1641597
##    52    0.2033564  0.71103720  0.1626070
##    53    0.2008612  0.71705547  0.1605899
##    54    0.1993548  0.72218563  0.1595504
##    55    0.1978639  0.72546067  0.1586734
##    56    0.1961731  0.72953079  0.1568469
##    57    0.1954966  0.73133750  0.1563703
##    58    0.1951083  0.73165913  0.1559163
##    59    0.1954342  0.73142672  0.1563248
##    60    0.1945521  0.73407420  0.1554235
##    61    0.1942734  0.73519142  0.1550002
##    62    0.1939906  0.73620379  0.1545418
##    63    0.1933081  0.73841644  0.1537798
##    64    0.1916171  0.74240822  0.1524416
##    65    0.1901881  0.74555668  0.1514005
##    66    0.1889426  0.74859297  0.1503419
##    67    0.1880755  0.75099396  0.1504392
##    68    0.1886906  0.74943809  0.1505145
##    69    0.1880535  0.75097920  0.1499628
##    70    0.1871292  0.75272886  0.1494515
##    71    0.1876003  0.75078433  0.1497480
##    72    0.1882455  0.75013881  0.1502932
##    73    0.1881496  0.75021446  0.1502262
##    74    0.1883367  0.74925995  0.1503886
##    75    0.1886930  0.74826112  0.1506383
##    76    0.1895856  0.74589659  0.1513955
##    77    0.1904953  0.74380766  0.1521961
##    78    0.1907048  0.74361950  0.1522588
##    79    0.1915250  0.74152050  0.1525900
##    80    0.1918778  0.74104473  0.1526974
##    81    0.1920204  0.74014277  0.1525084
##    82    0.1928190  0.73854002  0.1531104
##    83    0.1928945  0.73843990  0.1532551
##    84    0.1939030  0.73704061  0.1536017
##    85    0.1950019  0.73479236  0.1542057
##    86    0.1962133  0.73236106  0.1546506
##    87    0.1964732  0.73210234  0.1548729
##    88    0.1956140  0.73485744  0.1540608
##    89    0.1947709  0.73647214  0.1535414
##    90    0.1944946  0.73706494  0.1533174
##    91    0.1944590  0.73706220  0.1536075
##    92    0.1949017  0.73601303  0.1537048
##    93    0.1946506  0.73586608  0.1539299
##    94    0.1943038  0.73786891  0.1532902
##    95    0.1932540  0.73929156  0.1524734
##    96    0.1924539  0.74126088  0.1518488
##    97    0.1915786  0.74295086  0.1503978
##    98    0.1910325  0.74418777  0.1501936
##    99    0.1901908  0.74492848  0.1495967
##   100    0.1891394  0.74752588  0.1482984
##   101    0.1887705  0.74918983  0.1479136
##   102    0.1876020  0.75186966  0.1470857
##   103    0.1867340  0.75354835  0.1460330
##   104    0.1857673  0.75506678  0.1455247
##   105    0.1850678  0.75622854  0.1453741
##   106    0.1845792  0.75722855  0.1447592
##   107    0.1842967  0.75816293  0.1442254
##   108    0.1842048  0.75869027  0.1438248
##   109    0.1845814  0.75833507  0.1440437
##   110    0.1846961  0.75835770  0.1445321
##   111    0.1843805  0.75910553  0.1445875
##   112    0.1836718  0.76089782  0.1441977
##   113    0.1833544  0.76155665  0.1439405
##   114    0.1839281  0.76063788  0.1445901
##   115    0.1846878  0.75873888  0.1451782
##   116    0.1854086  0.75678404  0.1454325
##   117    0.1858427  0.75570013  0.1454108
##   118    0.1867552  0.75341325  0.1458435
##   119    0.1871573  0.75303129  0.1462695
##   120    0.1875917  0.75199187  0.1467771
##   121    0.1877231  0.75217913  0.1471695
##   122    0.1883204  0.75089285  0.1477259
##   123    0.1894988  0.74841938  0.1489248
##   124    0.1902347  0.74678090  0.1488453
##   125    0.1911159  0.74503179  0.1496036
##   126    0.1922176  0.74141038  0.1504216
##   127    0.1933499  0.73820696  0.1513061
##   128    0.1941851  0.73563057  0.1517562
##   129    0.1957510  0.73219968  0.1529509
##   130    0.1961765  0.73104096  0.1533076
##   131    0.1954730  0.73321467  0.1522365
##   132    0.1953622  0.73351761  0.1518672
##   133    0.1956447  0.73368890  0.1524046
##   134    0.1944719  0.73734711  0.1515091
##   135    0.1939982  0.73958598  0.1511059
##   136    0.1924691  0.74371782  0.1499603
##   137    0.1899712  0.74943413  0.1478741
##   138    0.1885253  0.75250164  0.1465785
##   139    0.1873816  0.75551123  0.1457448
##   140    0.1862555  0.75810127  0.1452686
##   141    0.1855026  0.76039318  0.1447727
##   142    0.1845431  0.76258935  0.1440838
##   143    0.1841034  0.76308439  0.1438683
##   144    0.1829719  0.76622920  0.1428369
##   145    0.1827408  0.76694890  0.1423945
##   146    0.1817672  0.76899347  0.1421126
##   147    0.1811207  0.77098129  0.1413597
##   148    0.1810381  0.77086067  0.1411490
##   149    0.1800398  0.77265473  0.1404799
##   150    0.1799469  0.77319657  0.1405296
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 150.
cv_pcr_plot <-  ggplot(data = crp_pca$results, aes(x = ncomp, y = RMSE, group=1)) +
                ggtitle("Principal Component Regression Cross-Validation Results") +
                theme(plot.title = element_text(hjust = 0.5)) +
                xlab("Number of Components") + 
                ylab("RMSE (Repeated Cross-Validation)") + 
                geom_line(color="blue")+
                geom_point()

cv_pcr_plot

vip(crp_pca$finalModel, method = "model", num_features = 125)
## Warning: il pacchetto 'pls' è stato creato con R versione 4.3.3
## 
## Caricamento pacchetto: 'pls'
## Il seguente oggetto è mascherato da 'package:caret':
## 
##     R2
## Il seguente oggetto è mascherato da 'package:stats':
## 
##     loadings

Decision Tree and Random Forest

Decision Trees are among the most commonly used algorithms in machine learning for regression and classification and they work by dividing the predictor space \(X_1, X_2, \dots, X_p\) in \(J\) distinct and non-overlapping regions \(R_1, R_2, \dots, R_J\) then for each test observation that falls in a region \(R_j\) we make the same prediction which is the average of all the training observations that fall in that region. In a beautiful world we would like to divide the space in regions which minimizes the Residual Sum of Squares

\[ RSS = \sum^n _ {j=1}\sum _ {i \in R_j}(y_i - \hat{y}_{R_j})^2 \]

where \(\hat{y}_{R_j}\) is the average response of the partition, the problem is that this approach is computationally impossible because it would need to try all the partitions and for this reason the actual way to partition the space is through the recursive binary division which is a top-down and greedy algorithm. The algorithm starts from the top of the tree then it starts doing binary divisions and each produces two new lower branches, the algorithm is greedy because at each step the best division is done only considering the current step instead of looking forward for better divisions. Once we have our regions \(R_1, \dots, R_J\) we predict their response by taking the average of training observations falling in each region. This procedure may be good at fitting the training data, in fact too good because over-fitting is a serious issue of this approach. The reason of over-fitting is that the final tree may be extremely complex so our aim is now to prune the tree in order to have a smaller one which is more effective on unseen data. The most intuitive approach would be to cross-validate the error of each sub-tree but this is nearly impossible in terms of computational cost so we use the cost complexity pruning: we basically build a full tree that stops growing when a terminal leaf has less than a set number of observation than we find the tree which minimizes the following function

\[ \sum^{|T|} _ {m=1}\sum _ {i: x_i \in R_m}(y_i - \hat{y}_{R_m})^2 + \alpha|T| \]

where \(\alpha\) is a tuning hyperparameter which decides the cost complexity of the tree and \(T\) is the number of terminal leaves of a specific tree, in this way we can use cross-validation to find the optimal value of \(\alpha\). This formulation is indeed very similar to the Lasso regularization discussed before and in the same way if \(\alpha = 0\) the tree is the unpruned tree. Let’s now fit a regression tree and look for the optimal value of \(\alpha\), here called cp which stands for cost complexity.

set.seed(123)

# Setup tuning grid for Regression Trees
tune_grid_tree <-  expand.grid(cp = seq(from = 0, to = 0.1, by = 0.01))

# Training Regression Tree
crp_tree <- train(crp ~., 
                  data = crp_full,
                  method = "rpart",
                  trControl = fit_control,
                  tuneGrid = tune_grid_tree)
crp_tree
## CART 
## 
## 188 samples
## 300 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 170, 168, 169, 171, 169, 169, ... 
## Resampling results across tuning parameters:
## 
##   cp    RMSE       Rsquared   MAE      
##   0.00  0.3371162  0.3225425  0.2489654
##   0.01  0.3384493  0.3151143  0.2506049
##   0.02  0.3354561  0.3116806  0.2495162
##   0.03  0.3397956  0.2900194  0.2532579
##   0.04  0.3423148  0.2770519  0.2557495
##   0.05  0.3436890  0.2710290  0.2583767
##   0.06  0.3439616  0.2589230  0.2594738
##   0.07  0.3490052  0.2331864  0.2655736
##   0.08  0.3567864  0.2048770  0.2721232
##   0.09  0.3632212  0.1839017  0.2779697
##   0.10  0.3628084  0.1842195  0.2778437
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02.

We got a final cp value of 0.02, which is indeed the value with the lowest RMSE as you can see from the plot below, the issue is that we got a somewhat unsatisfying result if we look at the \(R^2\) of the model.

cv_tree_plot <-  ggplot(data = crp_tree$results, aes(x = cp, y = RMSE, group=1)) +
                 ggtitle("Regression Tree Cross-Validation Results") +
                 theme(plot.title = element_text(hjust = 0.5)) +
                 xlab("Cost Complexity") + 
                 ylab("RMSE (Repeated Cross-Validation)") + 
                 geom_line(color="blue")+
                 geom_point()

cv_tree_plot

Let’s now look at the divisions made during the procedure.

crp_tree$finalModel
## n= 188 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 188 26.3050300 10.083130  
##    2) ascB< 3.988938 56  4.1376740  9.830751  
##      4) caiA< 4.309405 45  2.4568180  9.761415  
##        8) cstA< 7.036434 19  0.6222862  9.585754 *
##        9) cstA>=7.036434 26  0.8198101  9.889784  
##         18) nanT>=3.012752 14  0.1411332  9.755867 *
##         19) nanT< 3.012752 12  0.1346890 10.046020 *
##      5) caiA>=4.309405 11  0.5795073 10.114400 *
##    3) ascB>=3.988938 132 17.0872500 10.190200  
##      6) nanK>=4.960536 44  3.7024490  9.917772  
##       12) sucD< 9.377993 10  0.7836589  9.530826 *
##       13) sucD>=9.377993 34  0.9811461 10.031580 *
##      7) nanK< 4.960536 88  8.4865650 10.326410  
##       14) ilvN>=7.155852 51  2.3343370 10.191940  
##         28) glpQ< 5.69012 14  0.6191667  9.992103 *
##         29) glpQ>=5.69012 37  0.9445091 10.267560 *
##       15) ilvN< 7.155852 37  3.9590040 10.511760  
##         30) caiC>=2.81531 30  1.2367910 10.384250  
##           60) cspE>=11.1791 23  0.5808969 10.309830 *
##           61) cspE< 11.1791 7  0.1100455 10.628760 *
##         31) caiC< 2.81531 7  0.1441127 11.058220 *

It is pretty hard to understand what is happening here so we can now visualize the tree in a nice way using the rattle package in order to have a more human readable result.

fancyRpartPlot(crp_tree$finalModel, sub = "", palettes = "Oranges")

As you can see the root node starts by predicting the expression value of crp as 10.083130, then at each division it refines the response value. The first division is done by looking at the expression of the gene ascB that encodes for the phospho-β-glucosidase which basically hydrolizes the bond between D-glucose and D-glucose 6-phospate. The latter is the product of glycolysis first reaction which is the starting point of the oxidation of glucose for the production of ATP. As you can see ascB is regulated by CRP when cAMP binds CRP activating it. If we get more D-glucose 6-phospate in the system we have more ATP and less cAMP which means that the activity of CRP should be lower; one important feature of CRP is that it negatively autoregulates itself, so it makes sense that for lower levels of ascB we have lower expression values of crp because there is less D-glucose 6-phospate and more cAMP in the cell which makes CRP more active which negatively autoregulates itself. Even if this first decision makes sense it is probably better to try a more robust method because decision trees are very easy to explain but not very accurate. Random Forest is a more sophisticated version of decision trees, in fact the rationale is to build many different decision trees (by default caret build 500 trees) and take as the final prediction the average response of them. Many different random trees are built and each one of them only use a random subset of all the possible predictors; this subset has a length \(m\) where the most used value is \(m = \sqrt{p}\), in this way we obtain many smaller trees that uses only a very small portion of the predictors for their divisions. Then it applies methods to decorrelate the trees and finally it takes the average of all the responses of this “random forest”.

set.seed(123)

# Setup tuning grid for Random Forest, m value is sqrt of the predictors.
tune_grid_forest <-  expand.grid(.mtry = sqrt(ncol(crp_full) - 1))

# Training Regression Tree
crp_forest <- train(crp ~., 
                  data = crp_full,
                  method = "rf",
                  trControl = fit_control,
                  tuneGrid = tune_grid_forest)
crp_forest
## Random Forest 
## 
## 188 samples
## 300 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 170, 168, 169, 171, 169, 169, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.2409224  0.7300424  0.1784262
## 
## Tuning parameter 'mtry' was held constant at a value of 17.32051

As you can see the cross-validation \(R^2\) and RMSE of this model is fully comparable to the linear regression methods we used before and random forest is known to be highly accurate and powerful and also as easy to interpret as linear regression. To look at the results we could sample a random tree out of the forest and visualize it as before but we may get the same bias so it is better to look at the distribution of the minimal depth that is the distance between the root of a tree and the node/split where a given gene was used.

# compute all the minimal depths of each gene
min_depth_frame <- min_depth_distribution(crp_forest$finalModel)

# Plot the results
plot_min_depth_distribution(min_depth_frame, k = 15)

This beautiful plot has on the x-axis the total number of trees and on the y-axis the genes we are considering, each color represents a depth value and the length of the colored box how many trees have that depth value for the considered gene. The plot also tells us the average minimal depth for that gene across all the trees. We chose to show only the top 15 genes by average minimal depth because a lower value of average minimal depth tells how important for the model the gene is. As you can see many of the target genes we discussed before like rpsF, galY, ascB are shown here and now we have multiple confirmations that these genes can be used as a proxy for the activity of crp. In conclusion random forest is a very reliable model for genes that have an high number of interactions but it is not as good for regulators that have a lower number of target genes because we would have much less options for tree splitting.